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Abstract. Because of the high cost of commercial genotyping chip tech- 
nologies, many investigations have used a two-stage design for genome- 
wide association studies, using part of the sample for an initial discovery 
of "promising" SNPs at a less stringent significance level and the re- 
mainder in a joint analysis of just these SNPs using custom genotyping. 
Typical cost savings of about 50% are possible with this design to ob- 
tain comparable levels of overall type I error and power by using about 
half the sample for stage I and carrying about 0.1% of SNPs forward to 
the second stage, the optimal design depending primarily upon the ra- 
tio of costs per genotype for stages I and II. However, with the rapidly 
declining costs of the commercial panels, the generally low observed 
ORs of current studies, and many studies aiming to test multiple hy- 
potheses and multiple endpoints, many investigators are abandoning 
the two-stage design in favor of simply genotyping all available sub- 
jects using a standard high-density panel. Concern is sometimes raised 
about the absence of a "replication" panel in this approach, as required 
by some high-profile journals, but it must be appreciated that the two- 
stage design is not a discovery/replication design but simply a more 
efficient design for discovery using a joint analysis of the data from 
both stages. Once a subset of highly-significant associations has been 
discovered, a truly independent "exact replication" study is needed in a 
similar population of the same promising SNPs using similar methods. 
This can then be followed by (1) "generalizability" studies to assess 
the full scope of replicated associations across different races, different 
endpoints, different interactions, etc.; (2) fine-mapping or resequencing 
to try to identify the causal variant; and (3) experimental studies of 
the biological function of these genes. Multistage sampling designs may 
be more useful at this stage, say, for selecting subsets of subjects for 
deep resequencing of regions identified in the GWAS. 
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1. INTRODUCTION 

Many of the genome-wide association studies 
(GWAS) currently underway or already reported have 
used some form of multistage sampling design (Sa- 
tagopan et al., 2002) because of the considerable sav- 
ings in geno typing costs this approach offers. In Sec- 
tion 2 we provide an overview of the basic approach 
to designing such studies, touching on such topics 
as the trade-offs between sample size and marker 
density, the selection of markers to carry forward 
to the second stage, methods of significance test- 
ing, the use of DNA pooling, and multistage designs 
for testing gene-environment (G x E) and gene-gene 
(G x G) interactions. Section 3 considers the general 
question of whether two-stage designs are still nec- 
essary in an era of declining costs and multipurpose 
studies. Finally, Section 4 discusses what should be 
done after a completed GWAS, including replica- 
tion, fine mapping, generalizability and functional 
studies, and revisits the utility of multistage designs 
in this context. 

2. BASIC PRINCIPLES OF TWO-STAGE 
STUDY DESIGN FOR GWAS 

Two-phase case-control designs were introduced 
to the epidemiologic literature by White (1982) and 
have been extensively developed in a series of pa- 
pers by Breslow and various colleagues (for a general 
overview of this literature, see Breslow and Chat- 
terjee, 1999). The basic idea of these designs is to 
use information available on all subjects in the main 
study to draw a more informative subsample for ad- 
ditional, more expensive, measurements, combining 
the information from both phases in the analysis. 

Two-stage sampling for GWAS, as introduced by 
gopan et al. (2002), is quite different, based 
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on genotyping part of the sample using a commer- 
cial high-density panel (typically 300,000 to a mil- 
lion SNPs) and then genotyping the most promising 
SNPs using a customized panel on the remainder 
of the sample. A final analysis combining the infor- 
mation from both samples is more powerful than 
treating the design as a hypothesis generation fol- 
lowed by independent replication (Skol et ah, 2006; 
Yu et ah, 2007) because it exploits the additional 
information about how significant the first stage as- 
sociations were, not just the fact that they exceeded 
some threshold. Formally, two stage designs can be 
conceptualized as a family of group sequential tests 
(one per SNP) with allowance for early stopping for 
"futility" (Jennison and Turnbull, 2000). 

Optimization of the design is usually framed as 
choosing the significance levels and the allocation of 
samples between the two stages in such a manner 
as to minimize the total cost while attaining the 
desired genome- wide significance level and power 
(Kraft, 2006; Kraft and Cox, 2008; Muller, Pahl 
and Schafer, 2007; Saito and Kamatani, 2002; Sa- 
tagopan and Elston, 2003; Satagopan, Venkatraman 
and Begg, 2004; Service, Sandkuijl and Freimer, 2003; 
Skol et ah, 2007; Wang et ah, 2006); alternatively, 
one might wish to maximize power subject to total 
sample size, genotype cost and type I error, or to 
minimize the total sample size subject to the other 
factors. These optimal designs are insensitive to the 
genetic model (mode of inheritance, relative risk and 
allele frequencies) and determined primarily by the 
total number of markers to be genotyped in stage I, 
the relative cost per genotype at stages I and II, the 
total available sample size, and whether (and how 
many) additional flanking markers will be tested 
around those selected from stage I. As an example 
of a cost minimization, the optimal design for a cost 
ratio of about 17.5 with 500K markers being tested 
in stage I and no additional SNPs being tested at 
stage II turns out to involve testing 30% of the sam- 
ple in stage I at a significance level of 0.0037 (i.e., 
about 1850 markers tested in stage II) and a signifi- 
cance level for the joint analysis of 1.6 x 10 -7 (Wang 
et ah, 2006); in this case, about 87% of the total cost 
goes to stage I genotyping, but the total cost is only 
40% that of a comparably powered one-stage design. 

Several authors (Eberle et ah, 2007; Gail et ah, 
2008; Nannya et ah, 2007) have investigated the 
power of GWAS, either for a single-stage or the 
first-stage of a multistage scan, and generally con- 
cluded that sample sizes of 1000 cases and 1000 con- 
trols were sufficient to detect associations in the 
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range of 1.7-2.0, smaller relative risks (e.g., 1.2— 
1.3) requiring much larger sample sizes. In general, 
minimum-cost two-stage designs can require consid- 
erably larger sample sizes than single-stage designs 
to achieve the same power. However, large costs re- 
ductions can still be achieved with a "nearly op- 
timal" two-stage design using a sample size only 
slightly larger than a one-stage design (Wang et al., 
2006). 

2.1 Trade-off Between Sample Size and 
Marker Density 

A crucial decision to be made is the choice of 
genotyping platform for stage I. At this writing, two 
companies — Affymetrix and Illumina — offer 
platforms ranging from 300K to 1M SNPs. The pan- 
els differ in the way SNPs were selected and, hence, 
their coverage (r 2 ) of the remaining common 
HapMap SNPs, as well as in their laboratory per- 
formance (call rates, reproducibility, etc.). Because 
coverage of SNPs is highly variable across the genome 
and the relationship between power and r 2 is nonlin- 
ear, the average power to detect an association with 
a random SNP is smaller than the power based on 
the average r 2 (Jorgenson and Witte, 2006). Instead, 
one must average the power for a given noncentral- 
ity parameter A at a putative causal locus across 
the distribution of r 2 s. The consequence is that one 
cannot simply add additional sample size to cover 
regions with poor coverage! For single stage studies, 
average power is maximized by choosing the plat- 
form with the best coverage on which it is afford- 
able to genotype all available samples. Comparisons 
of recent platforms tend to show that when geno- 
typing budget is limiting, sacrificing sample size for 
the higher density platform is not usually appro- 
priate (Hao, Schadt and Storey, 2008; Lewinger et 
al., 2007b; Nannya et al, 2007). A two-stage de- 
sign, however, can alter the sample-size vs. coverage 
trade-off in favor of higher density platforms in the 
first stage by allowing the use of all the available 
samples at a lower cost. Imputation, on the other 
hand, reduces the differences between SNP panels, 
making the lower cost, lower density platforms more 
attractive (Anderson et al., 2008). See also Barrett 
and Cardon (2006), de Bakker et al. (2005) and Pe'er 
et al. (2006) for further discussion. 

2.2 Design Complications 

2.2.1 Additional markers Additional markers 
flanking some or all hits might also be added to bet- 
ter characterize the full range of genetic variation 



in the region (Saito and Kamatani, 2002; Wang et 
al., 2006). With 5 additional markers being tested 
for each hit, the optimal design for the situation 
discussed above raises the first stage sample size to 
49% and reduces significance levels to 0.0005 and 
0.5 x 10~ 7 respectively, so that 95% of the total 
cost goes to stage I genotyping. In these calcula- 
tions, it was assumed that the additional markers 
would be imputed for the first-stage sample, using 
methods described in Section 2.2.2 and more com- 
prehensively elsewhere in this issue (Su et al., 2009), 
but one could instead test them directly on the stage 
I samples first and then decide which ones to carry 
forward to stage II. Further work on optimization of 
such designs is still needed. 

While it seems intuitively appealing to also use the 
replication step for the purpose of fine mapping — 
that is, to see whether there is another marker in 
the region that shows even stronger evidence for 
association — the yield from doing so may be min- 
imal. Consider the three possible situations: (1) an 
associated marker is in perfect LD with a causal 
variant; (2) it is in weak LD with a causal variant; 
or (3) it is nowhere near the causal variant. Only 
in the second case would adding additional markers 
be of any help. Suppose the first stage sample has 
power 1 — (3\ to detect the first kind of association 
and 1 — /?2 for the second, and let 7r/% denote the 
prior probability of type k. Then the prior proba- 
bility that the association is of type 2 given p < a± 
is 

(1 -/3 2 )7T 2 

(1 - /3i)7ri + (1 - (3 2 )ir 2 + ai7r 3 ' 

Considering the coverage of current platforms, ir\ is 
probably larger than tt 2 and tr^ is certainly much 
larger, so most detected associations are likely to be 
of types 1 or 3 and additional markers will not help 
[Peter Kraft, personal communication]. 

Clarke et al. (2007) have shown theoretically and 
by simulation that the increased penalty for multiple 
comparisons can defeat any possible gains in power 
for replication. Nevertheless, the inclusion of addi- 
tional markers can be advantageous in regions of 
relatively high LD when the original signal is weak, 
such as in regions where the coverage by the origi- 
nal panel is poor, but then any new associations dis- 
covered in the "replication" stage would require yet 
further confirmation. In general, they recommend 
deferring fine mapping to a separate sample from 
that used for replication. Thus, fine mapping should 
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be reserved for the regions that are interesting in the 
combined stage I and stage II data, rather than in- 
corporated into stage II for all markers carried over 
from stage I. This keeps the multiple comparisons 
problem at a minimum whether or not a new (third) 
sample is used for fine mapping. 

2.2.2 Haplotypes, multimarker tests and imputa- 
tion of missing markers The commercial panels are 
designed to allow for testing not just the hundreds 
of thousands or a million SNPs on the panel, but 
also all the roughly 5M common variants in the hu- 
man genome they tag, including copy number vari- 
ants. This entails using some form of multimarker 
or haplotype-based approach to "impute" genotypes 
to all those variants that are not directly tested. 
Promising associations with imputed variants de- 
tected in the initial scan are then tested in either 
the original sample or the follow-up stage by di- 
rect genotyping. While at first blush it might seem 
that the multiple comparisons penalty for testing 
5M variants would offset the advantage of using a 
tag-SNP approach, the correlation between tests due 
to LD means that the "effective number of indepen- 
dent tests" is only about 1M in European-descent 
populations or 2M in African-descent populations 
(Pe'er et al., 2008). Four companion papers in this 
issue (Chatterjee et al., 2009; Su et al., 2009; God- 
dard et al., 2009; Zollner and Teslovich, 2009) ad- 
dress various aspects of this topic in greater detail. 

2.2.3 Family-based designs One compelling advan- 
tage of a two-stage approach may be the opportu- 
nity to exploit different study designs, in particu- 
lar, family- and population-based. For example, the 
Cancer Family Registries for breast and colorectal 
cancer are currently undertaking GWASs aimed at 
exploiting their unique resource, combining the two 
sampling schemes (Zheng et al., 2010). In the first 
stage, a population-based series of cases that is en- 
riched for a positive family history or young age at 
onset is compared with unrelated population con- 
trols; hits from this stage are then to be tested using 
family-based association tests (FBAT) in the second 
stage using sibling or cousin controls to weed out 
false positives due to population stratification (see 
the contribution by Astle and Balding, 2009 in this 
issue), and finally, in a third stage, combining the 
phenotypes of all relatives from extended pedigrees 
with all available genotypes in a conditional segre- 
gation analysis (Hopper et al., 1999). A different 
two-stage design uses between-family comparisons 



to select a subset of SNPs with high power to detect 
associations in an FBAT, and tests associations with 
this subset using within-family comparisons in the 
same data set (Van Steen et al., 2005). For further 
discussion of these various options, see the compan- 
ion paper in this issue by Laird and Lange (2009). 

If instead of a FBAT design, some form of ge- 
nomic control is to be used with population-based 
case-control studies in a two-stage design, then prob- 
lems can arise if the subjects in the two stages are 
derived from different populations. One approach is 
to estimate kinship using the available data from 
the different stages (a high-density chip for stage I, 
just the selected SNPs and perhaps some additional 
ancestry-informative markers in stage II). 

2.2.4 More than two stages? In principle, there is 
no reason why the two-stage design described above 
could not be extended to a multistage design, with 
successively smaller proportions of SNPs being tested 
in new samples at each subsequent stage. Indeed, 
some of the earliest studies were conducted in just 
that manner (Hirschhorn and Daly, 2005). Multi- 
ple stages would have the practical effect of reduc- 
ing the genotyping cost ratio between the first stage 
and the combined later stages, perhaps by a sig- 
nificant factor. Inclusion of additional stages would 
be most cost effective when the genotyping cost ra- 
tio is 1 between the platforms used in the second 
and later stages (Kraft et al., 2008). The additional 
complexities in both design optimization and final 
significance testing of results have yet to be fully 
explored, however. 

2.3 Methods of Significance Testing for 
Two-Stage Designs 

Two-stage designs pose special challenges to sig- 
nificance testing in the final analysis of the combined 
data. The basic p- value to be computed is the prob- 
ability that a given SNP would have been deemed 
"promising" at the first stage and that the combined 
data would show significance at a genome-wide level 
given that it was selected for testing in the second 
stage, under the null hypothesis that it is not associ- 
ated with disease. The fact that two "hurdles" have 
to be crossed for each "significant" result means that 
the p- value of interest is actually somewhat smaller 
than the "nominal" p-value based on analyzing the 
combined data. The various two-stage design pa- 
pers discussed earlier have shown how to compute 
this probability under simplifying assumptions and 
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thereby optimize the design, but these approxima- 
tions can often be improved upon in analyses of 
real data. Among other assumptions is that of inde- 
pendence across SNPs, which is necessary to derive 
the appropriate cutoff for genome-wide significance. 
An obvious way to avoid having to make such as- 
sumptions is some form of a permutation test. For a 
single-stage design, this is straightforward: one could 
simply hold the genotypes fixed (thereby maintain- 
ing their LD structure) and randomly permute the 
phenotypes in a standard case-control design (or 
analogous methods for family-based studies, based 
on within- family permutation). In a two-stage de- 
sign, this is not so straightforward, however, as one 
must permute the entire analysis; but a random per- 
mutation of the stage-one data would yield a differ- 
ent set of SNPs to be tested in stage II and these 
genotypes are not available for permuting! 

Two methods have been proposed to assess signif- 
icance in two-stage studies. They both make clever 
use of the fully genotyped stage I subjects to mimic 
the effect of having two stages of genotyping. Both 
require large numbers of subjects in stages I and II, 
an assumption that would be usually met for a well- 
powered GWAS. In Dudbridge's (2006) method, a 
permutation null distribution is computed by per- 
forming the full two-stage analysis on a large num- 
ber of permuted datasets in which a subsample of 
the stage I subjects plays the role of the stage I 
sample, and the original stage I subjects play the 
role of the combined stage I and stage II samples. 
The method is valid under exchangeability of the 
stage I and stage II samples, provided the permuta- 
tion distribution "stabilizes" for large samples. The 
Monte Carlo method of Lin (2006) relies on the fact 
that the efficient scores functions have, jointly for 
all tests in stages I and II, a mean-zero asymptotic 
multivariate normal distribution under the complete 
null hypothesis, and that all score, Wald or likeli- 
hood ratio test statistics commonly used to test sin- 
gle SNPs or haplotypes are asymptotically equiva- 
lent to simple chi-square statistics based on the ef- 
ficient score functions. Assuming that the subjects 
are randomly chosen for stages I and II, the asymp- 
totic variance matrix of the efficient scores can be 
estimated based on the observed efficient score func- 
tions for stage I only. Monte Carlo replicates can 
then be efficiently drawn from the estimated asymp- 
totic multivariate normal distribution of the effi- 
cient scores, and the chi-square statistics equiva- 
lent to the original tests computed for each Monte 



Carlo replicate. Adjusted p- values can be computed 
based on the Monte Carlo replicates. An advantage 
of Lin's method is that it does not require recalcu- 
lation for each Monte Carlo replicate of the original 
tests statistics that can be computationally costly, 
but only for the simpler equivalent chi-square tests 
based on the efficient scores. This can result in sig- 
nificant time savings. Both Lin's and Dudbridge's 
method can be extended to two-stage family-based 
GWAS but not to studies using case-control samples 
in stage I and families in stage II. 

Methods based on Bonferroni correction using an 
"effective number of tests" (see Section 2.2.2) for 
a given platform in a single-stage design have typ- 
ically relied on permutation tests applied to data 
sets where very large numbers of SNPs are geno- 
typed in relatively small numbers of subjects (e.g., 
the HapMap). Just as for the methods described 
above, there is an implicit assumption in these cal- 
culations that the null distribution of the minimum 
p-value for a group of tests does not depend very 
strongly on the number of subjects in the analysis 
but only on the LD pattern between the tests con- 
sidered. 

The entire subject of adjustment for multiple com- 
parisons is rapidly evolving. For a recently proposed 
method and a review of other methods, see Han, 
Kang and Eskin (2009). 

2.4 Selection of SNPs for the Next Stage 

Another decision entails the selection of SNPs to 
be carried from stage I to stage II or to be reported 
as "significant" at the end of the study. Of course the 
true causal association may not lie anywhere near 
the top of the distribution of p-values (Zaykin and 
Zhivotovsky, 2005). Furthermore, if the distribution 
includes some false positives due to bias (e.g., dif- 
ferential genotyping error), then the most significant 
findings are more likely to be false positives. 

Most of the literature has assumed that p-values 
for single SNP associations will be used for select- 
ing SNPs to carry forward, although alternatives 
have been suggested, including the population at- 
tributable risk (Hunter and Kraft, 2007), the False 
Positive Report Probability (Hunter and Kraft, 2007; 
Samani et al., 2007), Bayes factors or q- values (Wake- 
field, 2008), empirical Bayes estimates of effect size 
(Hunter and Kraft, 2007) or multimarker methods 
like a local scan statistic (Guedj et al., 2006). But 
such approaches make no use of any external infor- 
mation that might suggest that some associations 
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were more credible than others a priori. For exam- 
ple, one might wish to give greater credence to asso- 
ciations with SNPs located in or near genes (partic- 
ularly those that may have a high prior probability 
of involvement in the disease) or highly conserved 
regions of the genome, coding SNPs, those located 
under a linkage peak, or those with previously re- 
ported associations. Often such information is used 
informally at the conclusion of a GWAS in decid- 
ing which associations to pursue with further fine 
mapping or functional studies. 

Roeder et al. (2006) and Roeder, Devlin and 
Wasserman (2007) have proposed a weighted False 
Discovery Rate framework and Bayesian versions 
have been proposed by Whittemore (2007) and Wake- 
field (2007). All of these allow a specific variable to 
be used to up- or down-weight the significance as- 
signed to each association. They showed that well 
chosen prior information can substantially improve 
the power for detecting true associations, while there 
was relatively little loss of power if that information 
is uninformative. 

Each of these approaches allows only a single vari- 
able to be incorporated, with weights specified in ad- 
vance. Hierarchical modeling approaches (Chen and 
Witte, 2007; Lewinger et al., 2007a) allow multiple 
sources of information to be empirically weighted 
in models for the probability that an association is 
null and the expectation of the magnitude of an as- 
sociation given that it is not null. Simulation stud- 
ies (Lewinger et al., 2007a) showed that when there 
was little or no useful prior knowledge, the standard 
p-value ranking performed best, but when at least 
some of the available covariates were strongly pre- 
dictive (even if one did not know which ones were 
truly predictive), the hierarchical Bayes ranking led 
to better power. For further discussion, see the pa- 
per by Pfeiffer, Gail and Pee (2009) in this issue. 

2.5 DNA Pooling 

DNA pooling offers another approach that could 
drastically reduce the cost of genotyping for a GWAS. 
While the idea has been around for some time (Bansal 
et al., 2002; Risch and Teng, 1998), the technical 
challenges in forming comparable pools and quan- 
tifying allele frequencies are formidable (Barratt et 
al., 2002; Feng, Prentice and Srivastava, 2004; Pfeif- 
fer et al., 2002; Sham et al., 2002; Zou and Zhao, 
2004). It is only recently that it has proved feasi- 
ble to apply this technique to high-density genotyp- 
ing arrays (Craig et al., 2005; Docherty et al., 2007; 



Johnson, 2007; Meaburn et al., 2006; Sebastiani et 
al., 2008; Zuo, Zou and Zhao, 2006). As currently 
employed, the design generally entails forming sev- 
eral small pools of cases and of controls in stage I 
and selecting SNPs on the basis of their differences 
in allele frequencies. These are then retested by in- 
dividual genotyping in stage II, possibly on both the 
original and a second sample. Much remains to be 
done to study the best choices of design parameters 
(numbers of pools, sample sizes, criteria for selecting 
SNPs to test by individual genotyping, etc.) (Mac- 
gregor, 2007) and to estimate the statistical power 
and false discovery rate for this approach in practice. 
However, empirical applications have demonstrated 
that DNA pooling is capable of detecting several 
associations that have previously been discovered 
and confirmed by individual genotyping in a GWAS 
context (Pearson et al., 2007). Furthermore, several 
studies using this approach have reported novel as- 
sociations (Kirov et al., 2009; Spinola et al., 2007; 
Steer et al., 2007), although it remains for these as- 
sociations to be confirmed independently. 

Several recent technological advances offer the po- 
tential to greatly improve the utility of DNA pool- 
ing. The first entails molecular "bar coding" of the 
individual DNA molecules contributing to each pool 
(Craig et al., 2008), so that the genotypes of the spe- 
cific individuals contributing to the subset of pools 
found to contain rare variants in excess in case pools 
compared to control pools can be readily reconstructed 
without the need for further genotyping. The sec- 
ond development entails the use of "pools of pools" 
to dramatically reduce the cost, so that it now be- 
comes feasible to obtain DNA sequence informa- 
tion on pools as large as 3000 (D. Duggan, TGen, 
personal communication). We will revisit the use 
of multistage designs using pooled DNA for deep- 
resequencing in the concluding section. 

2.6 Multistage Designs for Testing Main Effects 
and Interactions 

The NIH "Genes and Environment Initiative" has 
focused attention on the use of GWAS for identify- 
ing genes that modify the effects of environmental 
agents (Kraft et al., 2007). Such studies pose ad- 
ditional methodological problems, beyond the usual 
challenges in assessing the main effects of genes and 
environmental factors, such as low power (Gauder- 
man, 2002) (for further discussion, see the paper 
by Kooperberg et al., 2009 in this issue). However, 
there is the opportunity to improve power by using 
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a case-only design (Piegorsch, Weinberg and Taylor, 
1994) in which G x E interaction is tested by testing 
for association between a gene and environmental 
factor among cases, under the assumption that this 
association does not exist in the general population. 
Such an assumption is not likely to hold for all possi- 
ble SNP x E interactions in a GWAS, but testing this 
assumption first in controls and deciding whether 
to perform a case-only or conventional case-control 
test accordingly can lead to substantial inflation of 
type I error rates (Albert et al., 2001). Nevertheless, 
more appropriate methods for combining the infer- 
ences from case-control and case-only analyses of 
the same data have been described (Chatterjee and 
Carroll, 2005; Chatterjee, Kalaylioglu and Carroll, 
2005; Cheng, 2006; Mukherjee et al., 2007; Mukher- 
jee et al., 2008; Mukherjee and Chatterjee, 2008). 
For example, Mukherjee and Chatterjee (2008) use 
an empirical Bayes compromise between the case- 
only and case-control estimators, weighted by the 
estimated probability of the existence of a G— E as- 
sociation. Rather than limiting the analysis to an 
all-or-nothing choice between case-only and case- 
control approaches, these methods have the advan- 
tage of letting the data and a prior estimate the most 
appropriate weight between models. In the case of 
SNP x SNP interactions, one may use LD informa- 
tion from HapMap to generate flexible priors that 
can greatly increase power (Li and Conti, 2009). In 
the context of a GWAS, various multistage designs 
are possible, such as using a case-only test in the 
combined sample of cases and controls to screen in- 
teraction effects and then confirming that subset by 
a standard case-control test in the same data set 
(Murcray, Lewinger and Gauderman, 2009). This 
design has been shown to be substantially more effi- 
cient than a single-stage scan using a standard case- 
control comparison. 

3. SINGLE VS. TWO-STAGE DESIGNS 

As the cost of commercial chips falls relative to 
custom genotyping, the merits of this approach will 
need to be reconsidered (Hunter et al., 2007). As 
mentioned above, faced with a choice between den- 
sity of SNPs and sample size in a single stage study, 
it is usually preferable to have the largest possi- 
ble sample size, even if this means not being able 
to afford a higher density chip. A two-stage design 
may, however, allow a higher density chip to be 
used in stage I than would be affordable in a single- 
stage design, and hence improve power for regions 



of low LD and overall mean power (Lewinger et al., 
2007b). The ability to combine different study de- 
signs (e.g., population-based and family-based) may 
also favor a two-stage design. Other considerations, 
however, may favor a one-stage design, such as faster 
study completion and simplified logistics and quality 
control due to use of a single genotyping platform. 
Additionally, multiple hypotheses can be tested us- 
ing these data, say, multiple phenotypes in a co- 
hort design or various subgroup analyses or interac- 
tion tests. For example, in addition to scanning for 
genetic main effects, the Southern California Chil- 
dren's Health Study (CHS) of the health effects of 
air pollution aimed to identify genes that interact 
with two measures of air pollution, exposure to traf- 
fic, in utero, and second-hand tobacco smoke, and 
GSTM1 (previously shown to be involved in several 
G x E interactions) or to differ between Hispanic and 
non-Hispanic children, each of these for two pheno- 
types, asthma and lung function development. SNPs 
might be selected from the initial scan for follow-up 
based on any of these criteria. In order to have rea- 
sonable power for detecting each of these effects, a 
custom panel of 12K markers or more would have 
been required, the cost of which begins to approach 
that of simply using the same high density panel as 
in the initial scan, so the decision was made to do a 
one-stage scan instead. In fact, in the NHLBLfunded 
STAMPEED consortium of GWAS for cardiovascu- 
lar, lung and blood disorders of which the CHS is a 
part (http:/ /public. nhlbi.nih.gov/GeneticsGenomics/ 
home/stampeed.aspx), most of the 13 participat- 
ing centers are using a one-stage design. In a one- 
stage design, replication of SNPs attaining genome- 
wide levels of significance is still needed, as discussed 
below. However, the combination of discovery and 
replication phases should not be regarded as a for- 
mal two-stage design, which we define as involving 
the testing of a large number of hits in a second 
stage and doing a joint analysis of both. This may 
involve optimizing the choice of sample sizes and 
significance levels as discussed earlier, but these two 
stages combined have the same goal as a 1-stage de- 
sign, namely, discovery. 

4. AFTER GWAS, WHAT NEXT? 

Multistage sampling designs for GWAS should not 
be thought of as a hypothesis generation followed by 
independent replication approach but rather as sim- 
ply a more cost-efficient way of conducting the dis- 
covery approach (Skol et al., 2006). Nevertheless, it 
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must be appreciated that any effect estimates (e.g., 
odd ratios) surviving the entire discovery process 
will tend to be biased away from the null because at- 
tention is focused only on those that are statistically 
significant, a phenomenon known as the "winner's 
curse" (Kraft, 2008; Yu et al, 2007; Zhong and Pren- 
tice, 2008; Zollner and Pritchard, 2007). Thus, some 
form of truly independent replication is needed, both 
to confirm the existence of the reported associations 
and to estimate the magnitude of their effect. In 
the following sections we distinguish between what 
we will call "exact replication" and "generalization," 
the latter being aimed at determining the full extent 
of a replicated association across populations, phe- 
notypes, modifiers, etc. In addition, an association 
initially reported may not be with the causal variant 
itself, but rather with some other variant it is in LD 
with, so further studies aimed at fine mapping or 
resequencing the region to identify the culprit may 
be needed. Finally, once plausible candidates for the 
causal variants have been identified, there is a need 
for further experimental studies to understand their 
biological function and additional in silico and epi- 
demiologic analyses to build a comprehensive model 
for the causal pathway. 

4.1 Replication 

Failure to replicate has been a recurring problem 
with candidate gene association studies, hence a ma- 
jor concern about the new generation of GWASs 
(Chanock et al., 2007; Ioannidis, 2007). (The com- 
panion paper by Kraft, Zeggini and Ioannidis, 2009 
in this issue explores the replication issues in greater 
depth.) True scientific replication must involve some- 
thing more than a repetition of the study on a sec- 
ond random sampling from the same population us- 
ing the same methods (Chanock et al., 2007; Clarke 
et al., 2007), since simply splitting a sample in half 
and requiring significance at level a in both halves 
is less powerful than a single analysis of the entire 
sample at significance level a 2 (Skol et al., 2006; 
Thomas et al., 1985). Nevertheless, the goal at this 
stage should be to avoid failure to replicate because 
of true differences in effect between the original and 
follow-up populations, investigation of real hetero- 
geneity being the subject of the next stage ( "general- 
ization" ) . Many granting agencies now expect inves- 
tigators to discuss plans for follow-up investigations 
of any associations detected and some high profile 
journals are requiring replication studies as part of 
a single report of a genetic association (Anonymous, 



1999; Rebbeck et al., 2004). In many cases, this 
might best be accomplished by collaborations with 
other groups with data on a genetically similar set 
of subjects. Failure to replicate may often be due to 
the use of replication data sets that were not well 
designed for this purpose because of heterogeneity 
between the original and replicate data sets or prob- 
lematic study designs that were generated for differ- 
ent purposes originally. Replication and generaliz- 
ability are often muddled together even though they 
are two different questions that are best addressed 
with different types of study populations — one se- 
lected to minimize heterogeneity and the others se- 
lected to maximize it. 

One question that frequently arises is whether to 
restrict replication claims to the same marker de- 
tected in the initial GWA scan ( "exact" replication) 
or to test additional markers in the region and al- 
low association with any of them (appropriately ad- 
justed for multiple comparisons) to be treated as 
evidence of replication ("local" replication) (Clarke 
et al., 2007). In a similar vein, associations first dis- 
covered in a GWAS by imputed SNPs should be 
confirmed by direct genotyping, either in the orig- 
inal samples, or better in independent replication 
samples, before a genuine association is claimed. In 
any event, a clear definition of replication is needed: 
generally this is taken to be a statistically signifi- 
cant association in the same direction, but now not 
requiring genome-wide multiple testing correction 
since only a subset of the top-ranking associations 
will be subject to replication and the magnitude of 
the original relative risk is likely to have been over- 
estimated. 

4.2 Generalization 

Once an association has been replicated, it be- 
comes important to investigate the full range of its 
effects. For example, one of the first questions to 
address is whether the effect differs across races. 
If so, this could be a sign that the association is 
not causal, but only a reflection of a causal effect 
of some other variant with which it is in LD, the 
patterns of LD differing across races, and would 
suggest that further fine mapping of the region is 
warranted. Furthermore, if there is heterogeneity by 
race/ethnicity, fine mapping within a race that ex- 
hibits the association of interest but has shorter 
LD blocks would help localize the signal more effi- 
ciently than in a race with longer LD blocks. Alter- 
natively, heterogeneity by race could be a reflection 
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of differences across races in the prevalence of some 
modifying factors — G x E or G x G interactions — 
indicating that further investigation of effect modifi- 
cation is warranted. Beyond the question of hetero- 
geneity by race/ethnicity, there are other questions 
of generalizability worth considering. Does the vari- 
ant have similar effects across different subtypes of 
the same disease (for example, for colorectal cancer, 
by location in the colon, age of onset, family history 
of colorectal or other cancers, presence or absence 
of selected molecular markers such as microsatellite 
instability, BRAF mutation, MLH1 methylation)? 
Does the variant have effects on other phenotypes — 
intermediate endpoints like incidence or recurrence 
of polyps for colorectal cancer or other cancer sites 
or even other diseases with which it might share a 
common etiologic pathway? After a result is con- 
firmed in a properly designed replication study, we 
would advocate a strategic approach to the question 
of generalizability, guided by a careful consideration 
of the most important knowledge gaps about the dis- 
ease, rather than the sometimes uncritical exercise 
of quickly testing for the reported SNP in whatever 
data sets are readily available (ignoring whether the 
result would fill an important gap in our knowledge 
base). 

4.3 Fine-Mapping and Deep Resequencing 

Unless there is compelling evidence that a newly 
discovered association with a particular SNP is in- 
deed causal, further fine mapping of the surrounding 
region is generally appropriate, given that the SNPs 
on the discovery panel represent at most 20% of all 
common variants and were selected primarily for 
their effectiveness at tagging other variants rather 
than as biologically plausible candidates themselves. 
Furthermore, it is becoming increasingly evident that 
multiple rare variants may play an important role 
in many diseases (Fearnhead et al., 2004; Iyengar 
and Elston, 2007; Kryukov, Pennacchio and Sun- 
yaev, 2007; Li and Leal, 2008; Pritchard, 2001). This 
search for the true culprit (s) — which could occur be- 
fore, after or concurrent with the generalization ac- 
tivities described above — might involve some com- 
bination of fine mapping with additional SNPs and 
deep resequencing and poses interesting challenges 
for study design, particularly in terms of the balance 
between additional efforts to fine-map a signal with 
additional genotyping of previously known variants 
versus jumping directly to deep resequencing for dis- 
covery. 



Fine mapping might explore a relatively large re- 
gion surrounding the associated SNP(s) and be in- 
formed by knowledge from HapMap of the LD struc- 
ture of the region. The goal would be to genotype a 
denser set of tag SNPs than was possible in the ini- 
tial GWAS in order to conduct haplotype or multi- 
marker association tests, as discussed elsewhere in 
this issue (Su et al, 2009; Goddard et al., 2009; 
Zollner and Teslovich, 2009). (For this purpose, one 
might wish to use a different population, such as 
those of African descent where LD blocks would 
tend to be shorter.) Deep resequencing would entail 
selection of a subset of participants from the main 
study for complete sequencing of the region to search 
for other relatively common variants (~l-5%) that 
may not have been characterized by HapMap. These 
variants would then be genotyped in the entire study 
sample to test for association. Because of the high 
cost of sequencing, it might be advisable to do the 
fine mapping first to narrow down the region of in- 
terest, but with the advent of next generation se- 
quencing, DNA "bar coding" and DNA pooling meth- 
ods (Craig et al., 2008), costs are coming down so 
rapidly that one might want to proceed directly to 
sequencing. Either approach might benefit from a 
formal two-stage sampling design, although the cost 
savings are likely to be more substantial for deep 
resequencing studies. 

4.3.1 Two-stage designs revisited The basic idea 
here would be to select a subset of subjects for addi- 
tional genotyping and/or sequencing who would be 
most likely to carry a causal variant. This subset of 
subjects serves two general goals. The first may sim- 
ply be to characterize the genetic variation or to dis- 
cover previously unknown variants within the region 
to then genotype in the larger main study with a 
more cost-effective genotyping technology. A second 
goal may be to formally combine the more detailed 
information for the subgroup with the data from the 
main study on only the selected SNPs. As mentioned 
at the outset, this idea has been extensively devel- 
oped in the literature on "two-phase sampling" in 
survey design and more recently in epidemiologic ap- 
plications. Unlike the two-stage designs for GWAS 
described above, these designs typically entail using 
information that is readily available on the entire 
case-control study to select a stratified subsample. 
For the first goal of characterization only, a sam- 
ple of only cases may be most efficient (see below). 
However, if one wishes to sample the subset for more 
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detailed measurements and then combine the two 
data sets in a joint analysis, one may need to sam- 
ple both cases and controls. Here, one needs to con- 
sider both the optimization of the informativeness 
of the subset for discovery as well as informative- 
ness for the ultimate case-control analysis that takes 
the sampling fractions into account. In the present 
context, the relevant stratifying variables might in- 
clude case/control status and the SNP genotypes 
(and possibly exposure variables if detected through 
a G x E interaction effect). Such an approach has 
been explored for candidate gene association stud- 
ies (Thomas, Xie and Gebregziabher, 2004), where 
information on a dense panel of SNPs in a targeted 
subsample is combined with a sparser panel from 
the main study for the purpose of localizing the sig- 
nal by LD mapping or for testing haplotype associa- 
tions. Here, each region identified in a GWAS would 
likely target a different subsample of subjects, based 
on the available SNPs in that region. 

A typical study might involve sequencing a sam- 
ple of about 48 or 96 individuals over perhaps a 100 
Kb region. Assuming that the region size has al- 
ready been established based on the pattern of SNP 
or haplotype associations from the initial GWAS, 
knowledge of the LD structure of the region and 
possibly additional fine-mapping, how then should 
this relatively small sample be selected to maximize 
the chances of discovering the real causal variant (s)? 

Suppose first that a positive association has been 
found with a single SNP (Table 1, top). If not itself 
causal, this could theoretically reflect either a dele- 
terious effect of another variant in positive LD with 
it or a protective effect of a variant in negative LD. 
Of the two possibilities, the former is much more 
likely, as negative LD with a protective minor allele 
is unlikely to generate a large positive association 
at a marker locus (see the second block in Table 1, 
where a perfectly protective allele in perfect negative 
LD yields a marker RR of only 1.067). The subjects 
with the highest yield of causal variants would then 
be cases carrying the minor allele of the associated 
SNP, with carrier controls somewhat lower but still 
much higher than either cases or controls carrying 
the major allele. 

Now suppose instead that the minor allele shows 
a negative association with disease. Again, if not 
itself causal, this could reflect either a protective 
effect of a variant in positive LD with it or a dele- 
terious effect of a variant in negative LD, these two 
scenarios now being roughly equally plausible (see 



bottom two blocks of Table 1, where both configu- 
rations yield similar marker RRs). In this case, the 
most informative subjects would be cases carrying 
the major allele or controls carrying the minor al- 
lele at the associated SNP, with the latter generally 
having a higher yield of causal variants. 

To summarize the situation with a single marker, 
if one is purely interested in maximizing the chances 
of identifying a causal variant that will then be geno- 
typed in the entire sample, then one could sample 
only carriers of the minor allele — cases if the marker 
association is positive, controls if it is negative. No 
weighting would be required for the analysis of the 
full study data for the discovered genotypes. If, on 
the other hand, one wishes to perform a joint anal- 
ysis of the main and substudy data incorporating 
the full sequence data on substudy subjects, then 
to be able to weight the analysis correctly, all four 
strata must be represented. The optimal sampling 
fractions would depend upon knowledge of the true 
LD and causal association parameters, but one could 
be guided by the general calculations illustrated in 
Table 1: if the association with the minor allele is 
positive, then sample the largest number of cases 
with the minor allele, then controls with the mi- 
nor allele, and the smallest number of carriers of 
the major allele; if the association is negative, then 
sample most heavily controls with the minor allele, 
then equal numbers of cases with the minor allele 
and controls with the major allele, and the smallest 
number of controls with the minor allele. 

Now, suppose the association is not just with a 
single SNP in a region but with several. A sensi- 
ble sampling design might now entail first construct- 
ing a risk index, say, by logistic regression of case- 
control status on multiple SNPs or haplotypes and 
then stratifying jointly on this genetic risk index and 
case-control status. The concept of positive or neg- 
ative association and LD is now moot and needs 
to be replaced by consideration of the shape of the 
distribution of the risk index (Figure 1). Typically, 
one might find a relatively small proportion of sub- 
jects with a broad range of high risk scores and a 
large proportion with generally low risk (Thomas et 
al., 2008). In this situation, it would be cases with 
high risk scores that are likely to be the most in- 
formative, although controls with high risk scores 
would also have an increased probability of carrying 
a causal allele. In the event that the risk distribu- 
tion has a long tail of low risk, it could be worth- 
while to sample controls with low risk scores, but 
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Table 1 

Illustrative calculation of probability of carrying a rare causal variant G among cases and controls carrying the major or 
minor allele at a marker locus M in LD with it: Pr(Af ) = 0.2, Pr(G) = 0.05. Bolded entries indicate the highest yield strata 

in each situation 



Marker 
M 



Disease 
Y 



Causal allele 



G = 



G = l 



Pr(G = l\M,Y) 



Positive marker association 
Positive LD and positive causal association 
S = 0.036, RRyg = 2, RRym = 1.22 
m 



M 



Negative LD and negative causal association 
8 = -0.010, RRyg = 0, RRym = 1.067 
m 



M 



Negative marker association 
Negative LD and positive causal association 
8 = -0.010, RRyg = 3, RRym = 0.889 
m 



M 



Positive LD and negative causal association 
8 = 0.036, RRyg = 0.5, RRym = 0.887 
m 



M 



Controls 
Cases 

Controls 
Cases 



Controls 
Cases 

Controls 
Cases 



Controls 
Cases 

Controls 
Cases 



Controls 
Cases 

Controls 
Cases 



0.796 
0.758 
0.154 
0.147 



0.750 
0.789 
0.200 
0.211 



0.750 
0.682 
0.200 
0.186 



0.796 
0.816 
0.200 
0.211 



0.004 
0.008 
0.046 
0.088 



0.050 
0.000 
0.000 
0.000 



0.050 
0.136 
0.000 
0.000 



0.004 
0.002 
0.046 
0.024 



0.005 
0.010 
0.230 
0.374 



0.063 

0.000 
0.000 
0.000 



0.063 
0.136 

0.000 
0.000 



0.005 
0.003 
0.230 
0.130 



detecting effects of rare beneficial alleles would re- 
quire enormous sample sizes. (Of course, a beneficial 
effect of a common allele is equivalent to a deleteri- 
ous effect of a rare allele.) As in the single-marker 
case, if the purpose is simply to discover potentially 
causal variants, then one could restrict the sample 
to high-risk cases, but if a joint analysis is planned, 
then a well-defined sampling scheme is required that 
assigns nonzero sampling probability to every indi- 
vidual. This could be accomplished by stratifying 
jointly on Y and Pr(y = 1|M) or in proportion to 
an estimated Pr(G? = 1\Y, M) for some hypothesized 
model. 

It must be appreciated that only very strong as- 
sociations would have any power for testing asso- 
ciations in the subsample alone. The real purpose 
is simply to identify novel variants that would then 
be genotyped in the main study. Having completed 
the sequencing of this stratified sample and geno- 
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Fig. 1. Illustration of a hypothetical distribution of an em- 
pirical risk index (M , right axis, dotted line) and yield of 
causal variant carriers (D = 1, left axis, solid and dashed lines 
for cases and controls respectively) as a function of M . 



typing of selected novel variants in the main study, 
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a joint analysis could be performed as described by 
Thomas, Xie and Gebregziabher (2004) to test for 
associations with all variants discovered in the rese- 
quencing sample, not just those actually genotyped 
in the main study. This essentially involves impu- 
tation of the missing data on main study subjects 
using the substudy data, but requires appropriate 
adjustment for the sampling fractions if they depend 
jointly on genotypes and disease, so that all cells of 
the stratification must be represented in the sample. 
For substudy subjects, the standard logistic model 
can be used by adding as an offset term the log of 
the ratio of genotype-specific case/control sampling 
probabilities. For main study subjects, the likeli- 
hood contribution becomes a more complex mixture 
of weighted logistic probabilities, although well ap- 
proximated by a logistic function if the disease is 
rare. Of course, imputation of very rare variants in 
the main study by leveraging a substudy with only a 
few occurrences of such variants is of dubious value, 
so all potentially causal variants should be geno- 
typed in the full sample, but imputation could be 
useful for exploiting LD patterns in the sequence 
data that might suggest regions worth closer study. 

These considerations are likely to be fundamen- 
tally altered in the near future by the public avail- 
ability of resequencing data from the "1000 Genomes 
Project" (http://www.1000genomes.org/), aimed 
at identifying variants at a frequency of 1% across 
the genome in approximately 1500 subjects (500 
Bantu-speaking, 500 Asian, 500 Caucasian). Data 
at an intermediate level of detail (e.g., deep rese- 
quencing of 1000 genes in 1000 individuals) will be 
released soon. Once completed, the 1000 Genomes 
Project will potentially reduce the need for extensive 
deep resequencing for variants in the 1—5% range, 
at least for studies conducted in comparable popula- 
tions, but would still leave open the question of rarer 
variants. Methods to leverage the 1000 Genomes 
data for imputation purposes or joint analysis with 
a two-phase sampling design, allowing for possible 
misspecification of the G-M relationships for the 
specific study's target population, would be useful. 

The advent of whole genome sequence using next- 
generation sequencing platforms (Mardis, 2008) may 
also resurrect interest in multistage designs, as ge- 
nome-wide scans for rare variants are unlikely to be 
feasible on the tens of thousands of subjects that will 
be needed, at least until the $1000 genome becomes 
a reality. Whether this will ever be feasible, given the 
much larger multiple-testing burden, the sparseness 



of data on any particular variant and the likelihood 
that rare variants will be less effectively tagged than 
common variants, remains to be seen. Nevertheless, 
these technologies will undoubtedly aid in follow- 
ing up larger and larger regions surrounding SNPs 
identified in a GWAS on larger and larger samples, 
requiring adjustment for a much larger universe of 
variants, rare and common (Hoggart et al., 2008a, 
2008b). 

4.4 Investigating Biological Function, eQTLs 
and Pathways 

Once a set of highly significant and replicated SNP 
associations has been found, what then? The chal- 
lenges posed by the study of the often broadly di- 
verse biologic functions of hits arising from GWASs 
should not be underestimated. Trying to determine 
the functionality of even a single GWAS hit can be 
a daunting task. While clearly in vitro and in vivo 
experimental studies would be appropriate to inves- 
tigate function, the initial steps in characterization 
depend upon whether hits are located within genes, 
near known genes or in "gene deserts." If the hit 
is within a gene, various software packages and web 
sites could be used to assess the potential functional 
role of the variant, and such in silico findings could 
then be confirmed in vitro by molecular approaches 
such as quantitative RT-PCR. If the hit is near a 
gene, it could implicate an adjacent gene, but it 
could also lie in an unannotated gene, an miRNA, 
or an enhancer or repressor element for some gene 
located far away (the most likely explanation if it is 
in a gene desert). Unannotated genes might be iden- 
tified through tiling gene expression arrays, while in 
silico and ChlP-chip methods might be used to iden- 
tify enhancer /repressor elements. 

Other types of analyses also might be undertaken 
that would involve more sophisticated analyses of 
the GWAS data, either (1) in an attempt to in- 
fer causal pathways from the pattern of associations 
and interactions using the kinds of network analysis 
tools that have been applied to gene expression and 
protein interaction data, or (2) to inform the search 
for effects in the GWAS data by incorporating exter- 
nal knowledge from pathway or genomic databases, 
literature mining or analysis of gene expression, pro- 
teomic, metabolomic or other — omics data, perhaps 
using hierarchical modeling or gene set enrichment 
analysis methods. See Chasman (2008), Gieger et al. 
(2008), Pan (2005) and Wang, Li and Bucan (2007) 
for discussion of some of these approaches. 
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A second phase of these studies could be to exam- 
ine any known biological functionality of the gene in 
question, and once again those applied approaches 
will depend upon several considerations such as the 
likely consequence of the SNP itself upon gene func- 
tion as well as prior knowledge of the gene and un- 
derstanding of its involvement, if known, in cellu- 
lar pathway (s). Gene expression data (even genome- 
wide data) might be leveraged to identify candidate 
genes/pathways. A SNP that lies within the coding 
region of a gene may be more likely to affect the nor- 
mal function of that gene, either through enhancing 
its effect or reducing its functionality. Biochemical 
assays may be available that could be applied to test 
the effect of a coding region variant on its known 
gene function, such as a role in apoptosis. Alterna- 
tively, where a hit lies adjacent to a gene or within a 
gene desert and possibly in an enhancer element or 
other regulatory region, the likely effect may be on 
the expression level of the gene — whether leading to 
higher or lower gene expression levels, mRNA sta- 
bility or post-translational protein levels in target 
tissues. The next steps of characterization would re- 
quire using knowledge of that gene and related path- 
ways to develop assays that would test the putative 
consequences of either elevated or reduced expres- 
sion of the gene product in appropriate cells. For 
some genes, accumulated knowledge of its role in 
the cell may be extensive; however, for others that 
knowledge may be sparse or even non-existent. Such 
prior knowledge may be used to help prioritize func- 
tional biological studies. However, they could have 
the effect of steering us away from further charac- 
terization of potentially interesting genes that have 
little prior biological knowledge due to the greater 
challenges that they pose. 

Given the potential complexity and diversity of 
methods that will need to be applied to follow up 
on any identified hit, a prioritization scheme will 
need to be developed that will likely involve many 
different considerations, such as the strength of the 
hit itself, whether the hit has any implications for 
disease subsets such as more aggressive forms of can- 
cer, whether the hit is potentially implicated in more 
than one disease (such as appears to be the case 
for the 8q24 region, which is related to at least 4 
cancers, and for several diabetes risk alleles that are 
found to be protective for prostate cancer) and prior 
biological knowledge. The hierarchical modeling ap- 
proaches discussed in Section 2.4 may be helpful for 
combining the evidence from the data at hand and 



these external sources of knowledge to prioritize hits 
for follow-up functional studies. None of these kinds 
of studies would be likely to involve the original epi- 
demiological study subjects, however, and further 
detailed investigations are likely to be gene-specific, 
so are beyond the scope of this article. 

Two-phase sampling designs may be particularly 
helpful for biomarker or expression measurements 
to inform the analysis of pathways. Such analyses 
may take the form of a network of latent variables, 
for which the biomarkers are viewed as surrogate 
measurements (Thomas, 2007). In such designs, one 
might wish to subsample jointly on some combina- 
tion of disease, exposure and genotype(s) to select 
individuals for biomarker measurements. For exam- 
ple, in a pharmacogenetic study, one might subsam- 
ple on the basis of outcomes and treatment assign- 
ment to target a GWAS or a resequencing study of 
a candidate region; or if GWAS data were already 
available, one might stratify by a multi-marker risk 
score, treatment and outcomes for a collection of 
biomarkers to investigate pathways (Thomas and 
Conti, 2007). Any study of biomarkers collected af- 
ter the outcome must, however, address the problem 
of "reverse causation," whereby the variable being 
measured (or the accuracy of its measurement) is 
affected by the disease or its treatment rather than 
the other way around. 

Finally, it is worth noting that an enhancement in 
our knowledge of the etiology of disease may have 
implications that transcend the merely predictive 
power of a specific variant. The relatively modest 
relative risks that have been discovered by GWASs 
for disease etiology could be due in part to selec- 
tion against high risk variants, but this is unlikely 
for response to modern pharmacologic agents. For 
example, SNPs in HMGCR have only a small effect 
on low density lipoprotein levels, but drugs targeting 
the protein encoded by HMGCR have a much larger 
effect (Altshuler, Daly and Lander, 2008). One can 
at least hope that solving the mystery of how vari- 
ants in a gene desert such as 8q24 appear to influ- 
ence the risk of a multitude of cancers would lead 
to methods aimed at preventing or treating the re- 
sulting diseases. Ultimately, the discovery of genetic 
modifiers of treatment response is central to the goal 
of personalized medicine. 
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